Skip to content

[AutoScaler] implement prometheus metrics client#84

Merged
TomerShor merged 17 commits intov3io:developmentfrom
weilerN:NUC-712-implement_prometheus_client
Jan 20, 2026
Merged

[AutoScaler] implement prometheus metrics client#84
TomerShor merged 17 commits intov3io:developmentfrom
weilerN:NUC-712-implement_prometheus_client

Conversation

@weilerN
Copy link
Collaborator

@weilerN weilerN commented Jan 12, 2026

📝 Description

This PR introduces a Prometheus-based metrics client to the autoscaler and refactors the metrics collection flow to support it.


🛠️ Changes Made

  • Added a new Prometheus client that implements MetricsClient interface some key functionalities:
    • Query templates
    • Filter relevant window size before query the Prometheus server
    • Round up values the ensure any Prometheus's fractional value > 0 becomes at least 1
  • Refactored autoscaler metrics flow to pass all nominated STZ resources instead of precomputed metric names.
  • Moved metric-name extraction logic from the autoscaler into the Kubernetes custom-metrics client.

✅ Checklist

  • I have tested the changes in this PR

🧪 Testing

  • Added comprehensive unit tests for the Prometheus metrics client using a mock Prometheus server.
  • Dev tests include backward compatibility

🔗 References


🚨 Breaking Changes?

  • Yes (explain below)
  • No

🔍️ Additional Notes

@weilerN weilerN force-pushed the NUC-712-implement_prometheus_client branch 4 times, most recently from cf7e195 to 7f0f482 Compare January 14, 2026 14:50
@weilerN weilerN force-pushed the NUC-712-implement_prometheus_client branch from 7f0f482 to 6482f50 Compare January 15, 2026 08:47
@weilerN weilerN requested review from TomerShor and rokatyy and removed request for rokatyy January 15, 2026 14:59
@weilerN weilerN force-pushed the NUC-712-implement_prometheus_client branch from bd6ad0a to 3944b48 Compare January 15, 2026 15:11
@weilerN weilerN force-pushed the NUC-712-implement_prometheus_client branch from 2f52fc9 to 1b9b74b Compare January 15, 2026 15:17
@weilerN weilerN marked this pull request as ready for review January 15, 2026 15:54
Copy link
Collaborator

@rokatyy rokatyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Some suggestions and comments below. And also, please change the pr title :)

"github.com/prometheus/common/model"
)

type PrometheusClient struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to align with K8sCustomMetricsClient:

Suggested change
type PrometheusClient struct {
type PrometheusMetricsClient struct {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue
}

resourceNameRegex := pc.buildResourceNameRegex(resources)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to run it on every iteration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"error", err)
continue
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very good in terms of O complexity.

resourceNameRegex := pc.buildResourceNameRegex(resources) iterates over resources on each iteration of queryTemplates without no reason.

Then, in windowSizes loop it does all the same as extractWindowSizesForMetric does again but even with worse complexity. And on every iteration we iterate over everything. Iterate over this list once and put the data into a map, where key is a metrics name, this way we can avoid unnecessary iterations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


func NewPrometheusClient(parentLogger logger.Logger, prometheusURL, namespace string, templates []scalertypes.QueryTemplate) (*PrometheusClient, error) {
if len(templates) == 0 {
return nil, errors.New("Failed to created Prometheus client: query template cannot be empty")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and it all the places below

Suggested change
return nil, errors.New("Failed to created Prometheus client: query template cannot be empty")
return nil, errors.New("Failed to create Prometheus client: query template cannot be empty")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 242 to 245
labelNames := []model.LabelName{
"function", // For nuclio functions (nuclio_processor_handled_events_total) - maps to nucliofunction resource
"service_name", // For deployments (num_of_requests, jupyter_kernel_busyness) - maps to deployment resource
"pod", // For pod-based metrics (DCGM_FI_DEV_GPU_UTIL) - maps to pod resource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constants?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, errors.Wrapf(err, "Failed to render query for metricName=%s, windowSize=%s", metricName, windowSize)
}

rawResult, warnings, err := pc.apiClient.Query(context.Background(), query, time.Now())
Copy link
Collaborator

@rokatyy rokatyy Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry if failed?

Also, I'm not sure about recreating context.Background() every time, especially that it's without timeout (so no actual purpose).

I suggest creating one ctx object per whole func and using semaphore to send requests to prometheus.

Copy link
Collaborator Author

@weilerN weilerN Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the context per iteration here - CR 3 - parallel querying and use single context per iteration

Regarding retries, I don’t think they’re needed at this level. If a request fails, it will be tried again in the next iteration anyway. The existing Kubernetes custom-metrics client also doesn’t use retries, so adding them here would make this implementation behave differently. WDYT?

Copy link
Collaborator

@rokatyy rokatyy Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weilerN what do you mean by:

adding them here would make this implementation behave differently
?

how often do we usually run? I think this flow is relatively critical as it scales things that might use GPUs and which affects the final cost of used resources. So if smth is wrong with network temporarily, I think it's better to retry now than later, isn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It run by default every 1m link.
I agree that adding a retry mechanism makes sense. That said, I think it’s out of scope for this CR, so I suggest we track it separately by opening a tech-debt ticket to address retries in a more comprehensive way.

@weilerN weilerN changed the title Nuc 712 implement prometheus client [Scaler] implement prometheus client Jan 15, 2026
@weilerN weilerN changed the title [Scaler] implement prometheus client [AutoScaler] implement prometheus client Jan 15, 2026
@weilerN weilerN changed the title [AutoScaler] implement prometheus client [AutoScaler] implement prometheus metrics client Jan 15, 2026
@weilerN weilerN force-pushed the NUC-712-implement_prometheus_client branch from ac8f30b to 549c27e Compare January 19, 2026 07:38
Copy link
Collaborator

@TomerShor TomerShor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better!

Comment on lines 220 to 226
for resultChan != nil || errChan != nil {
select {
case result, ok := <-resultChan:
if !ok {
resultChan = nil
continue
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it the same as:

Suggested change
for resultChan != nil || errChan != nil {
select {
case result, ok := <-resultChan:
if !ok {
resultChan = nil
continue
}
for {
select {
case result, ok := <-resultChan:
if !ok {
return
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same, because in your case the goroutine will exist on the first channel that will be closed, regardless of the second channel and I want that it will exist only after the 2 channels will be closed.
I agree that for { is better than for resultChan != nil || errChan != nil { so I will change it and add comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 236 to 240
case err, ok := <-errChan:
if !ok {
errChan = nil
continue
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Prometheus-based metrics client implementation to support scaling decisions based on Prometheus metrics, refactoring the metrics collection architecture to be more flexible.

Changes:

  • Added a new PrometheusMetricsClient that implements the MetricsClient interface with support for query templates, window size filtering, and automatic rounding of fractional values
  • Refactored the MetricsClient interface to accept Resource objects instead of pre-computed metric names, moving metric name extraction logic into individual client implementations
  • Exported helper functions (ShortDurationString, GetKubernetesMetricName) to support both K8s and Prometheus clients

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/scalertypes/types.go Added QueryTemplate struct and methods, updated MetricsClient interface signature, exported utility functions for metric name generation
pkg/autoscaler/metricsclient/prometheus.go New Prometheus client implementation with query templating, concurrent metric fetching, and value rounding logic
pkg/autoscaler/metricsclient/prometheus_test.go Comprehensive unit tests for Prometheus client using mock HTTP server
pkg/autoscaler/metricsclient/k8s.go Updated K8s custom metrics client to accept resources and extract metric names internally
pkg/autoscaler/metricsclient/factory.go Added factory support for creating Prometheus client
pkg/autoscaler/autoscaler.go Simplified by removing getMetricNames function and passing resources directly to metrics client
go.mod, go.sum Added Prometheus client dependencies and updated transitive dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to +51
type windowSizeLookup map[string]map[string]struct{}

// metricLookup maps metricName → windowSizeLookup
type metricLookup map[string]windowSizeLookup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of adding these types instead of returning the native types directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it improve the readability of the nested structure


// GetResourceMetrics retrieves metrics for multiple resources
func (pc *PrometheusMetricsClient) GetResourceMetrics(resources []scalertypes.Resource) (map[string]map[string]int, error) {
ctx, cancel := context.WithTimeout(context.Background(), pc.interval)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it takes longer than an interval and the order in which we execute requests is always the same, are we at risk of failing every time with the same functions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a single function have a consistent error it won't be STZ.
I updated the code so a single function failure won't fail the all other resources' STZ here - CR 5 - failure only if all goroutines are failing, log failure on the…

Comment on lines 121 to 133
resultCh := make(chan result, 1)
defer close(resultCh)

go func() {
metrics, err := pc.getResourceMetrics(ctx, metricToWindowSizes)
resultCh <- result{metrics: metrics, err: err}
}()

select {
case <-ctx.Done():
return nil, errors.Wrap(ctx.Err(), "timeout waiting for resource metrics")
case res := <-resultCh:
return res.metrics, res.err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this select is basically limit the execution for interval.
I added some comments in the code here - CR 5 - failure only if all goroutines are failing, log failure on the…

Comment on lines 142 to 148
wg := sync.WaitGroup{}

for metricName, queryTemplate := range pc.queryTemplates {
windowSizeToResources := metricToWindowSizes[metricName]

for windowSize, resourcesInWindowSize := range windowSizeToResources {
wg.Add(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use semaphore with limited number of goroutines, no need to run all at once, go scheduler will spend more time switching if number of functions is large

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of goroutines is limited to the number of windowSize.
The default is 6 link so I would prefer to keep the wg use as is

}

var collectedErrors []error
collectorDone := make(chan struct{})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we need at least one error to fail, why getting all the errors. I suggest having one atomic var (firstErr or whatever) and use CompareAndSwap to set it. If at least one error occurred, fail. Why do we store each error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weilerN
Copy link
Collaborator Author

weilerN commented Jan 20, 2026

Hey all,

I’ve summarized the feedback into the following decision:

If there is a partial failure during execution (e.g., failing to query a specific window size, failing to parse a response, etc.), the error will be logged and the overall routine will continue.

The metrics client will return an error only if all metrics fail, or if there is a conflict between two metrics for the same resource and it cannot determine which one to use. We can further discuss how to handle the second case if needed.

@weilerN weilerN requested review from TomerShor and rokatyy January 20, 2026 09:14
Copy link
Collaborator

@TomerShor TomerShor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@weilerN weilerN force-pushed the NUC-712-implement_prometheus_client branch from 4b3d22f to e89e0b1 Compare January 20, 2026 11:23
Copy link
Collaborator

@rokatyy rokatyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TomerShor TomerShor merged commit a81e869 into v3io:development Jan 20, 2026
3 checks passed
TomerShor pushed a commit to nuclio/nuclio that referenced this pull request Jan 21, 2026
### 📝 Description
This code changes are supporting the metricsClient changes in the scaler
repository.
This PR introduce the configurations additions and init the scaler with
the metrics client.

---

### 🛠️ Changes Made
- Add `MetricsClient` configuration to the platform config
- Update the `createAutoScaler` to use the metricsClient

---

### ✅ Checklist
- [ ] I updated the documentation (if applicable)
- [x] I have tested the changes in this PR

---

### 🧪 Testing
- Dev test
- backward compatibility: nothing is changed with the new scaler and no
configuration changes
- prometheus client: added the relevant configurations and validates
that functions STZ as expected

---

### 🔗 References
- Ticket link: https://iguazio.atlassian.net/browse/NUC-713
- Design docs links:
https://iguazio.atlassian.net/wiki/spaces/PLAT/pages/627572831/Nuclio+STZ+using+Prometheus+client+-+HLD#Support-Prometheus-client-during-scaler-init
- Merge blocked by: v3io/scaler#84

---

### 🚨 Breaking Changes?

- [ ] Yes (explain below)
- [x] No

<!-- If yes, describe what needs to be changed downstream: -->

---

### 🔍️ Additional Notes
<!-- Anything else reviewers should know (follow-up tasks, known issues,
affected areas etc.). -->
<!-- ### 📸 Screenshots / Logs -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants